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ABSTRACT 

Wc present and describe a catalog of galaxy photometric redshifts (photo-z's) for the Sloan Digital 
Sky Survey (SDSS) Data Release 6 (DR6). We use the Artificial Neural Network (ANN) technique 
to calculate photo-z's and the Nearest Neighbor Error (NNE) method to estimate photo-z errors 
for ~ 77 million objects classified as galaxies in DR6 with r < 22. The photo-z and photo-z error 
estimators are trained and validated on a sample of ~ 640, 000 galaxies that have SDSS photometry 
and spectroscopic redshifts measured by SDSS, 2SLAQ, CFRS, CNOC2, TKRS, DEEP, and DEEP2. 
For the two best ANN methods we have tried, wc find that 68% of the galaxies in the validation set 
have a photo-z error smaller than ags = 0.021 or 0.024. After presenting our results and quality tests, 
we provide a short guide for users accessing the public data. 

Subject headings: photometric redshifts sdss - Sloan Digital Sky Survey 



1. INTRODUCTION 

While spectroscopic redshifts have now been measured 
for over one million galaxies, in recent years digital sky 
surveys have obtained multi-band imaging for of order 
a hundred million galaxies. Deep, wide-area surveys 
planned for the next decade will increase the number 
of galaxies with multi-band photometry to a few billion. 
Due to technological and financial constraints, obtain- 
ing spectroscopic redshifts for more than a small fraction 
of these galaxies will remain impractical for the foresee- 
able future. As a result, over the last decade substan- 
tial effort has gone into developing photometric redshift 
(photo-z) techniques, which use multi-band photometry 
to estimate approximate galaxy redshifts. For many ap- 
plications in extragalactic astronomy and cosmology, the 
resulting photometric redshift precision is sufficient for 
the science goals at hand, provided one can accurately 
characterize the uncertainties in the photo-z estimates. 

Two broad categories of photo-z estimators are in 
wide use: template-fitting and training set methods. 
In template-fitting, one assigns a redshift to a galaxy 
by finding the redshifted spectral energy distribution 
(SED), selected from a libary of templates, that best 
reproduces the observed fluxes in the broadband fil- 
ters. By contrast, in the training set approach, 
one uses a training set of galaxies with spectroscopic 
redshifts and photometry to derive an empirical re- 
lation between photometric observables (e.g., magni- 
tudes, colors, and morphological indicators) and red- 
shift. Exam ples of empirical met hods include Polyno- 
mial Fitting (IConnollv et al.l Il995f >. the Nearest Neigh- 
bor method (jCsabai et al.l 12003), the Nearest Ne ighbor 
Polynomial (NNP) technique lCunha et all I2007D Arti- 
ficial Neural Networ k s (ANN) TCollister fc La hS|2004j; 
IVanzella et all 120041 : Id'Abrusco et al l 120071) . and Sup- 
port Vector Machines (|Wadadekarll2005f ). When a large 
spectroscopic training set that is representative of the 



photometric data set to be analyzed is available, train- 
ing set techniques typically outperform template-fitting 
methods, in the sense that the photo-z estimates have 
smalle r scatter and bias with respect to the true red- 
shifts (jCunha et al.ll2007h . On the other hand, template- 
fitting can be applied to a photometric sample for which 
relatively few spectroscopic analogs exist. For a compre- 
hensive review and comparison of photo-z methods, see 
iCunha et all (f2007h . 

In this paper, we present a publicly available galaxy 
photometric redshift catalog for the Sixth Data Release 
(DR6) of th e Sloan Digital Sky Survey (SPSS) imag- 
ing ca' 



or tne Sloan Digita l SKy survey (ODSS) ima.g 
talog jBlanton et al.ll2003l: lEisenstein et al.ll2001 



Gunn et al Ml998l : llvezic et al.ll2004t IStrauss et ail 12002 
York et al.M2000h . We use the ANN photo-z method, 



which we have show n to be a superior training set method 
(jCunha et al.ll2007l) . and briefly compare the results us- 
ing different photometric observables. We also compare 
the ANN results with those from NNP, an empirical 
method which achieves sim ilar performance to the ANN 
method (jCunha et al.ll2007t) . Since the SDSS photomet- 
ric catalog covers a large area of sky, a number of deep 
spectroscopic galaxy samples with SDSS photometry are 
available to use as training sets, as shown in Fig. [1] In 
combination, these spectroscopic samples cover the full 
apparent magnitude range of the SDSS photometric sam- 
ple. 

The paper is organized as follows. In [J2] wc briefly 
describe the SDSS DR6 photometric catalog and the se- 
lection criteria used to obtain the galaxy photometric 
sample from the catalog. In £|3] wc describe the spec- 
troscopic catalogs used to construct the photo-z train- 
ing and validation sets. In []4] wc outline the photo-z 
methods as well as the photo-z error estimator technique 
applied to the galaxy sample. Statistical results for pho- 
tometric redshift performance, errors, and redshift dis- 
tributions are presented in |JS] In [JB] we make recom- 
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mendations for possible additional cuts on the photo-z 
catalog based on our own flags and those in the SDSS 
database. In Sj7] we briefly describe how to access the 
photo-z catalog from the public SDSS data server, and 
in fj8]we present our conclusions. For completeness, Ap- 
pendix [A] provides the database query used to select the 
photometric sample, Appendix IBl discusses issues of star- 
galaxy separation, and Appendix [Cl briefly describes an 
earlie r version of the photo-z algorith m used for SDSS 
DR5 (jAdelman-McCarthv et aT]|2007a| ). 

2. SDSS PHOTOMETRIC CATALOG AND GALAXY 
SELECTION 

The SDSS comprises a large-area imaging survey of 
the north Galactic cap, a multi-epoch imaging survey 
of an equatorial stripe in the south Galactic cap, and 
a spectro scopic survey of roughly 10 6 galaxies and 10 5 
quasars (lYork et all 120(51 . The survey uses a d edi- 
cated, wide-field, 2.5m telescope (jGunn et alj l2006h at 
Apache Point Observatory, New Mexico. Imaging is 
carried o ut in drift-scan m ode using a 142 mega-pixel 
camera (|Gunn et al.1 l2006f ) that gathers data in five 
broad ban ds, ugriz, spanning t he range from 3,000 to 
10,000 A (|Fukugita et al.lll996| ). with an effective ex- 
posure time of 54.1 seconds per band. The images 
are p r ocessed using sp ecialized software (|Lupton et al.l 
120011: IStoughton et all l2002t ) and are astrometrically 
dPier et al.ll2003D a nd photometrically (jHogg et al.ll200ll: 
iTucker et al.ll2006[) calibrat ed using observatio ns of a set 
of primary standard stars (|Smith et al.ll2002l ) observed 
on a neighboring 20-inch telescope. 

The imaging in the sixth SDSS Data Release (DR6) 
covers an essentially contiguous region of the north 
Galactic cap, with only a few small patches remaining 
to be observed. In any region where imaging runs over- 
lap, one run is declared primary 1 and is used for spec- 
troscopic target selection; other runs are declared sec- 
ondary. The area covered by the DR6 primary imaging 
survey, including the southern stripes, is 8417 deg 2 , but 
DR6 includes both the prima ry and secondary observa- 
tions o f each area and source (jAdelman-McCarthv et all 
I2007UI . 

The SDSS database provides a variety of measured 
magnitudes for each detected object. Throughout this 
paper, we use dcrcddcncd model magnitudes to perform 
the photometric redshift computations. To determine 
the model magnitude, the SDSS photometric pipeline fits 
two models to the image of each galaxy in each passband: 
a de Vaucouleurs (early-type) and an exponential (late- 
type) light profile. The models are convolved with the 
estimated point spread function (PSF), with arbitrary 
axis ratio and position angle. The best-fit model in the 
r band (which is used to fix the model scale radius) is 
then applied to the other passbands and convolved with 
the passband-dependent PSFs to yield the model magni- 
tudes. Model magnitudes provide an un biased color esti- 
mate in the absence of color gradients (jStoughton et al.1 
l2002h . and the deredde ning procedure remo ves the effect 
of Galactic extinction (jSchlegel et al.lll998l ). 

To construct the photometric sample of galaxies 
for which we wish to estimate photo-z's, we ob- 

For the precise definition of primary objec ts sec 
http : //cas . sdss . org/dr6/en/help/docs/glossary . asp#P 
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Fig. 1. — Normalized r magnitude distributions for various cat- 
alogs. Top three rows: the distributions of the spectroscopic cata- 
logs used for photo-z training and validation arc shown for 2SLAQ, 
CFRS, CNOC2, TKRS, DEEP and DEEP2, and the SDSS spec- 
troscopic sample. Ntot denotes the total number of galaxy mea- 
surements used from each catalog; for galaxies in regions with re- 
peat SDSS imaging, each independent photometric measurement 
is counted separately. Bottom row: (left) — the distribution of the 
combined spectroscopic sample; (right) — the distribution for the 
SDSS photometric galaxy sample, where objects were classified as 
galaxies according to the photometric TYPE flag (see text). 



TABLE 1 
Photometric Sample Properties 



AB magnitude limits RMS photometric 

calibration errors 
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22.0 
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2% 


9 


22.2 


u-g 


3% 


r 


22.2 


g-r 


2% 


i 


21.3 


r — i 


2% 


z 


20.5 


i — z 


3% 



Note. - Magnitude limits are for 95% com- 
pleteness for point sources in typical seeing; 50% 
completeness numbers are gen erally 0.4 mag fainter 
( Adclman- McCarthy ct al. 2007a). The median seeing for 
the SDSS imaging survey is 1.4". 



taincd a catalog drawn from the SDSS CasJobs web- 
site EttpT77casJobsTsdss.org/cas jobs/. We checked 
some of the SDSS photometric flags to ensure that we 
have obtained a reasonably clean galaxy sample. In par- 
ticular, we selected all primary objects from DR6 that 
have the TYPE flag equal to 3 (the type for galaxy) 
and that do not have any of the flags BRIGHT, SATU- 
RATED, or SATUR_CENTER set. For the definitions of 
these flags we refer the reader to the PHOTO flags entry 
at the SDSS website 2 or to Appendix [A] We also took 
into account the nominal SDSS flux limit (see Table UJ) 
by only selecting galaxies with dcrcddcncd model magni- 
tude r < 22.0. The full database query we used is given 

2 http : //cas . sdss . org/dr6/en/help/browser /browser . asp 
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Fig. 2. — Distribution of g — r and r — i colors for different 
SDSS samples. Top row: the color distributions for galaxies in the 
SDSS spectroscopic sample. Middle row: the color distributions for 
galaxies in the other (non-SDSS) spectroscopic training samples. 
Bottom row: the color distributions for galaxies in the photometric 
sample. As above, galaxy/star classification used the photometric 
TYPE flag. 

in Appendix [SJ 

The photometric galaxy catalog we have selected suf- 
fers from impurity and incompleteness at some level, 
since the photometric pipeline cannot separate stars from 
galaxies with 100% success at faint magnitudes. We 
describe some of our tests of star/galaxy separation in 
Appendix E where we show that the SDSS TYPE flag 
provides star/galaxy separation performance similar to 
other methods. 

The final photometric sample comprises 77, 418, 767 
galaxies. The r magnitude distribution of this sample 
is shown in the bottom right panel of Fig. [TJ the g — r 
and r — i color distributions are shown in the bottom 
panels of Fig. [2] 

3. SPECTROSCOPIC TRAINING AND VALIDATION SETS 

Since our methods to estimate photo-z's and photo-z 
errors are training-set based, we would ideally like the 
spectroscopic training set to be fully representative of 
the photometric sample to be analyzed, i.e., to have sim- 
ilar statistical properties and magnitudc/rcdshift distri- 
butions. Training-set methods can be thought of as in- 
herently Bayesian, in the sense that the training-set dis- 
tributions form effective priors for the analysis of the 
photometric sample; to the extent that the training-set 
distributions reflect those of the photometric sample, we 
may expect the photo-z estimates to be unbiased (or at 
least they will not be biased by the prior). Given the 
practical difficulties of carrying out spectroscopy at faint 
magnitudes and low surface brightness, such an ideal gen- 
erally cannot be achieved. Realistically, all we can hope 
for is a training set that (a) is large enough that sta- 
tistical fluctuations are small and (b) spans the same 
magnitude, color, and redshift ranges as the photometric 
sample. Fortunately, our tests indicate that the esti- 



mated photo-z's depend only weakly on the shape of the 
redshift and magnitude distributions of the training set 
for the SDSS. 

We have constructed a spectroscopic sample consisting 
of 639, 911 galaxies that have SDSS photometry measure- 
ments (counting repeats; see below) and that have spec- 
troscopic redshifts measured by the SDSS or by other 
surveys, as described below. We imposed a magnitude 
limit of r < 23.0 on the spectroscopic sample and ap- 
plied additional cuts on the quality of the spectroscopic 
redshifts reported by the different surveys. Since we im- 
pose a limit of r < 22.0 for the SDSS photometric sam- 
ple, the fainter limit chosen for the spectroscopic train- 
ing sample accommodates the full photometric range of 
interest without creating boundary effects for photo-z's 
of galaxies with magnitudes near the photometric sam- 
ple limit of r = 22. Each survey providing spectro- 
scopic redshifts defines a redshift quality indicator; we 
refer the reader to the respective publications listed be- 
low for their precise definitions. For each survey, we 
chose a redshift quality cut roughly corresponding to 
90% redshift confidence or greater. The SDSS spec- 
troscopic sample provides 531,672 redshifts, principally 
from the MAIN and Luminous Red Galaxy (LRG) sam- 
ples, with confidence level z con { > 0.9. The remaining 
redshifts are: 21,123 from the Canadian Network for 
Observational C osmology Field Galaxy Survey (CNOC2; 
lYee et al.ll2000ft 1, 830 from the Canada-France Redshift 
Survey (CFRS: ElllveFaD [l995h with Class > 1, 31,716 
from the Deep Ex tragalactic Evolutionary Probe (DEEP; 
iDavis et alJl200lD w ith q z = A or B and from DEEP2 
(|Weiner et all I2005D 3 with z quality ^ 3, 728 from the 
Team Keck Redshift Survey (TKRS: IWirth et al.ll200l 
with z qua iit y > -1, and 52, 842 LR Gs from the 2dF-SDS S 
LRG and QSO Survey (2SLAQ; iCannon et all I20061) 4 
with z op > 3. 

We positionally matched the galaxies with spectro- 
scopic redshifts against photometric data in the SDSS 
BestRuns CAS database, which allowed us to match 
with photometric measurements in different SDSS imag- 
ing runs. The above numbers for galaxies with red- 
shifts count independent photometric measurements of 
the same objects due to multiple SDSS imaging of the 
same region; in particular SDSS Stripe 82 has been im- 
aged a number of times. The numbers of unique galax- 
ies used from these surveys are 1,435 from CNOC2, 
272 from CFRS, 6,049 from DEEP and DEEP2, 389 
from TKRS, and 11,426 from 2SLAQ. The SDSS spec- 
troscopic samples were drawn from the SDSS primary 
galaxy sample and therefore arc all unique. The spectro- 
scopic sample obtained by combining all these catalogs, 
including the repeats, was divided into two catalogs of 
the same size (~ 320, 000 objects each). One of these cat- 
alogs was taken to be the training set used by the photo-z 
and error estimators, and the other was used as a valida- 
tion set to carry out tests of photo-z quality (see ij4.1|) . 
Our tests indicate that this procedure of treating differ- 
ent images of the same training/ validation set galaxies 
as independent objects leads to good results, provided 
all the photometric measurements for a given object are 
confined to either the training set or the validation set 

3 http://deep.berkeley.edu/DR27J 

4 http : //lrg . physics . uq . edu . au/New_dataset2/ 
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Fig. 3. — A simple FFMP network with 3 layers and configuration 2:1:1. The inputs are the two magnitudes, mi and m^. Ix denotes 
the input from node x, and Ox is the corresponding output of this node. The weights w associated with each connection are found by 
training the network using training and validation sets (see text). 



and not mixed. By contrast, excluding such multiple im- 
ages from the spectroscopic sample would result in much 
smaller training and validation sets; these would be very 
sparse at faint magnitudes, leading to much diminished 
photo-z quality there. On the other hand, splitting the 
repeat images of a given object between the training and 
validation sets may result in "over-fitting" of the derived 
photo-z 's (see §4.ip . 

The r-magnitude and color (g — r and r — i) distri- 
butions for the spectroscopic samples and for the pho- 
tometric sample are shown in Figs. Q] and [2] While the 
magnitude and color distributions of the combined spec- 
troscopic sample are not identical to those of the photo- 
metric sample, the spectroscopic sample does span the 
range of apparent magnitude and color of the photomet- 
ric sample. To test the impact of having a training set 
that is not fully representative of the photometric sam- 
ple, we divided the spectroscopic sample into smaller, 
alternate training and validation sets. For instance, to 
test the effect of the training-set magnitude distribu- 
tion on the photo-z estimates, we created a training set 
with a flat r magnitude distribution and another with 
an r magnitude distribution similar to that of the pho- 
tometric sample. Our tests indicated that the photo-z 
quality is not strongly affected by the magnitude distri- 
bution of the training set. The changes in the photo- 
z performance metrics (the rms scatter and the 68% 
CL region, defined below in §2]) were smaller than 10% 
when the training-set magnitude distribution was varied 
between these different choices. Since using the entire 
spectroscopic sample for the training and validation sets 
produced marginally better results than all other cases 
tested, we have adopted this as our final choice. In ad- 
dition, we tested the effect of the size of the training set 
on our photo-z calculations. We found that the photo-z 
performance metrics defined in §5.11 arc degraded by no 
more than 10% when the training set is artificially re- 
duced to 10% of its original size. Even when the training 
set is reduced to ~ 1% of its original size, the photo- 
z performance metrics are degraded by less than 25%. 
This gives us confidence that the spectroscopic training 
set size used here is sufficient for extracting nearly opti- 
mal photo-z estimates. 



4.1. ANN and NNP Photometric redshifts 

The ANN method that we use to estimate galaxy 
photo-z 's is a general classification and interpolation tool 
used successfully in an array of fields such as hand writ- 
ing recognition, automatic aircraft piloting 5 , detecting 
credit card fraud 6 , and extracting as tronomically inter- 
estin g sources in a telescope image (jBertin fc Arnoutsl 
fl996h . 

We use a particular type of ANN called a Feed For- 
ward Multilayer Perceptron (FFMP) to map the rela- 
tionship between photometric observables and redshifts. 
An FFMP network consists of several input nodes, one or 
more hidden layers, and several output nodes, all inter- 
connected by weig hted connections (s e e Fig^ . [3]). We fol- 
low the notation of ICollister fc Lahavl (|2004f ) and denote 
a network with Ni input nodes, . nodes in hidden layer 
j, and N D output nodes as Ni : Ni n : Nh 2 ■ ■■■ ■ Nh m ■ N Q . 
For each input object, the input photometric data (e.g., 
magnitudes, colors, concentrations, etc.) are fed into the 
input nodes of the FFMP, which fire signals according 
to the values of the input data. Each node in a hidden 
layer receives a total input which is a weighted sum of 
the outputs from the nodes in the previous layer, i.e., 
node i in a hidden layer receives an input 1{ given by 

3 

where Oj is the output of the j th node of the previous 
layer and Wij is the weight of the connection between 
node i in the hidden layer and node j in the previous 
layer. Given the input the output Oi of node i is a 
function / of the input, 

Oi = /(/<), (2) 

where / is the activation function. Repeating this pro- 
cess, signals propagate up to the output nodes. The ac- 
tivation function is typically a sigmoid function: 



4. METHODS 
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http: //www .nasa. gov/centers/dryden/news/NewsReleases/2003/03-49 .ht] 



http: //www. visa. ca/ en/ about /visabenef its/innovation. cf m 
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However, there are various alter natives, such as step 
functions and hyperbolic tangents. IVanzella et al.l (|2004l ) 
show that the choice of activation functions makes no sig- 
nificant difference in the result. 

We use X:20:20:20:l networks to estimate photo-z's, 
where X is the number of input photometric parame- 
ters per galaxy. The corresponding number of degrees 
of freedom (the number of weights) is roughly 1,000, de- 
pending on the actual value of X. We use hyperbolic 
tangent functions as the activation function of the hid- 
den layers and a linear activation function for the output 
layer. 

Despite the occasional aura of mystery surrounding 
neural networks, an FFMP is nothing more than a com- 
plex mathematical function; in fact, one can always write 
down the analytic expression corresponding to a neural 
network function. 

Once the network configuration is specified, it can be 
trained to output an estimate of redshift given the input 
photometric observables. The training process involves 
finding the set of weights Wij that minimize a score func- 
tion E, chosen here to be 

S =^E(^pcc-4) 2 , (4) 

i 

where z spoc is the measured spectroscopic redshift, z Q is 
the output redshift of the output node, and the sum is 
over all galaxies in the training set. Note that the choice 
of score function is not unique, and different choices will 
in general lead to different photo-z estimates. The min- 
imization of this score function can be done efficiently 
because its derivatives with respect to the weights are 
available analy tically. We use a V ariable Metric method 
as described in lPress et al.l (|1992f ) for the minimization. 

In machine learning, over-fitting refers to the tendency 
of an algorithm with many adjustable parameters to fit to 
the noise in the training set data. In order to avoid over- 
fitting, we use the technique of early stopping. The spec- 
troscopic sample is divided into two independent subsets, 
the training and validation sets, and the formal mini- 
mizations are done using the training set. After each 
minimization step, the network is evaluated on the vali- 
dation set, and the set of weights that performs best on 
the validation set is chosen as the final set. Another is- 
sue in machine learning is that minimization procedures 
that start at different initial choices of weights generally 
end at different local minima of the score function. To 
reduce the chance of ending in a less-than-optimal local 
minimum, we minimize five networks starting at differ- 
ent positions in the space of weights. Among these, we 
choose the network that gives the lowest photo-z scatter 
(cf. Eq. 2|) in the validation set. For more details of 
our implementation of the ANN and its performance on 
mock catalogs and real data, see iCunha et al.l (l200l . 

The ANN photo-z algorithm is very flexible in the sense 
that it is easy to change the input parameters, the train- 
ing set, and the network configurations. We tried a vari- 
ety of combinations of possible input photometric observ- 
ables to see their effects on photo-z quality. We calcu- 
lated photo-z's using galaxy magnitudes, colors, and the 
concentration indices for some or all of the passbands. 
The concentration index Cj in passband i is defined as the 
ratio of PetroR50 and PetroR90, which are the radii that 



encircle 50% and 90% of the Petrosian flux, respectively. 
Early-type (E and SO) galaxies, with centrally peaked 
surface brightness profiles, tend to have low values of the 
concentration index, while late-type spirals, with quasi- 
exponential light profil es, typically have higher values 
of c. Previous studies (lMorganlll958t IShimasaku et al.1 
12001 lYamauchi etHI 120051: iPark fc Choil I2005D have 
shown that the concentration parameter correlates well 
with galaxy morphological type, and we used it to help 
break the degeneracy between redshift and galaxy type. 
We present the photo-z results for different combinations 
of input parameters in ij5l 

For comparison, we also computed photo-z's for the 
validation set using another empirical meth od, the Near- 
est N eighbor Polynomial (NNP) technique (|Cunha et al.l 
2007). In NNP, to derive a photo-z for a galaxy in the 
photometric sample, we look for its training-set nearest 
neighbors in the space of photometric observables (mag- 
nitudes, colors, etc.). Suppose we have Nd photometric 
data entries for each galaxy. The data vector for the 
galaxy of interest in the photometric sample is denoted 
by = {D l ,D 2 , D Nd ), while the data vector for the 
i th galaxy in the training set is Df = (Dj,Df, Df D ). 
The distance ali between the photometric object and the 
i training set galaxy is defined using a flat metric in 
data space, 

The nearest neighbors are the training-set objects for 
which di is minimum. Once the nearest neighbors for a 
given galaxy are identified, they are used to fit the coef- 
ficients of a local, low-order polynomial relation between 
photometric observables and redshift. The galaxy photo- 
z is then obtained by applying the derived relation to the 
photometric object. 

For the NNP method employed in this work, we take 
the photometric data in Eq. ([5]) to be the four "ad- 
jacent" galaxy colors u — g, g — r, r — i, i — z; we 
found that this choice produces results marginally better 
than using the galaxy magnitudes. We use the near- 
est 1000 neighbors to fit a quadratic polynomial relation 
between redshift and the photometric data, here chosen 
to be the five magnitudes in each passband (ugriz) and 
their correspondin g concentration indices. We note that 
I Wang et al.1 (|2007| ) used a similar technique to estimate 
photo-z's for a small sample of SDSS spectroscopic galax- 
ies. They applied the Kernel Regression method of order 
0, weighting the training-set neighbors and computing 
photo-z's by using the weighted average of the neighbors' 
rcdshifts. Our NNP method is closer to a Kernel Regres- 
sion of order 2, since we perform quadratic fits; however, 
we do not apply variable weights to the neighbors but 
treat them equally in the fit. 

Whereas the ANN method provides a single, nonlinear, 
global fit using the whole training set and applies the 
derived photo-z relation to all photometric objects, the 
NNP method yields a separate, linear (in parameters), 
local fit for each photometric object using its neighbors. 
If the galaxy magnitude-concentration-redshift hypersur- 
face is a differentiable manifold, i.e., if it can be locally 
approximated by a hyperplane even though it is glob- 
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ally curved, then these two photo-z methods should be 
roughly equivalent. Indeed, as we show in $5j their per- 
formance is very similar. 

4.2. Photometric redshift errors 

We estimated photo-z errors for objects in the photo- 
metric cat alog using the Neare st Neighbor Error (NNE) 
estimator (|Ovaizu et all l2007j ). The NNE method is 
training-set based, with a neighbor selection similar to 
the NNP photo-z estimator; it associates photo-z errors 
to photometric objects by considering the errors for ob- 
jects with similar multi-band magnitudes in the valida- 
tion set. We use the validation set, because the photo-z's 
of the training set could be over-fit, which would result 
in NNE underestimating the photo-z errors. 

The procedure to calculate the redshift error for a 
galaxy in the photometric sample is as follows. We find 
the validation-set nearest neighbors to the galaxy of in- 
terest. In contrast to NNP, where the distance in Eq. 
was defined in color space, the NNE distance is defined in 
magnitude space, since photo-z errors correlate strongly 
with magnitude. Since the selected nearest neighbors are 
in the spectroscopic sample, we know their photo-z er- 
rors, Sz = Zphot — ^spoc, where z p hot is computed using the 
ANN or the NNP method. We calculated the 68% width 
of the Sz distribution for the neighbors and assigned that 
number as the photo-z error estimate for the photomet- 
ric galaxy. Here we selected the nearest 200 neighbors 
of each object to estimate its photo-z error. In stud- 
ies of photo-z error estimators applied to mock and real 
galaxy catalogs, we found that NNE accurately predicts 
the photo-z error when the training set is repr esentative 
of the photometric sample (|Ovaizu et al.|[2007l ). 

4.3. Estimating the Redshift Distribution 

As we shall see in £15.11 estimates for galaxy photo-z's 
suffer from statistical biases that in general cannot be 
completely removed on an object-by-object basis. How- 
ever, we can seek an unbiased estimate of the true red- 
shift distribution for the photometric sample that is in- 
dependent of individual galaxy photo-z estimates. For 
some statistical applications, the redshift distribution of 
the photometric sample, as opposed to individual galaxy 
photo-z's, is all that is required. One way to estimate 
this distribution is to assign a weight to every galaxy in 
the spectroscopic sample such that the weighted spectro- 
scopic sample has the same distributions of magnitudes 
and colors as the photometric sample. The z S pcc distri- 
bution of this weighted spectroscopic sample provides an 
estimate of the true, underlying redshift distribution of 
the photometric sample. 

The weight W a of the a th spectroscopic galaxy is cal- 
culated by comparing the local density around the galaxy 
in the spectroscopic sample with the density of the cor- 
responding region in the photometric sample. The local 
density is evaluated by counting the number of nearest 
neighbors using the distance measured in the space of 
photometric observables, as in Eq. ((5|). We fix the num- 
ber of spectroscopic neighbors, N$, which determines the 
distance c£ max to the JVg j -nearest spectroscopic neighbor. 
We then find the number of neighbors Np in the pho- 
tometric sample within the same distance d max of the 
spectroscopic galaxy. Up to an arbitrary normalization 
factor, the weight is defined as 



W~jl. (6) 

For our estimates, we chose N§ = 20, which provides a 
good match of the weighted spectroscopic distributions 
of magnitudes and colors to those of the photometric 
sample. We note that if additional cuts in magnitude or 
color are applied to the photometric sample, then this 
procedure must be repeated for the newly selected pho- 
tometric sample. More details and tests of this method 
and comparisons with other methods for estimating the 
underlying redshift distribution (e.g., deconvolving the 
error distribution f rom the z P hot dist ribution) will be pre- 
sented separately ()Lima et al.ll2007| ). 

5. RESULTS 
5.1. Photometric redshifts 

The photo-z precision (variance) and accuracy (bias) 
arc limited by a number of factors. There are intrinsic de- 
generacies in magnitudc-rcdshift space: low-luminosity, 
intrinsically red galaxies at low redshift can have ap- 
parent magnitudes similar to those of high-luminosity, 
intrinsically blue galaxies at high redshift. This natu- 
ral degeneracy is amplified by photometric errors, since 
magnitude uncertainties propagate to photo-z errors. In 
addition to these observational limitations, which are de- 
termined by the photometric precision and the number 
of passbands of a survey, the photo-z estimator itself may 
have inherent limitations. For example, for training set 
methods, the size and representativeness of the training 
set are important factors, as are the number of parame- 
ters or weights in the fitting functions. 

To test the quality of the photo-z estimates, wc use 
four photo-z performance metrics. The first two metrics 
are the photo-z bias, Zbias, and the photo-z rms scatter, 
a, both averaged over all N objects in the validation set, 
defined by 



z bias — y ] ( z phot ^spec) ! (7) 
i=l 

1 N 

^ 2 = ^^( Z Phot -^spec) 2 • (8) 

t=l 

The third performance metric, denoted by o^s, is the 
range containing 68% of the validation set objects in the 
distribution of Sz ~ z p hot — z spcc . This metric is useful 
because the probability distribution function P(Sz) is in 
general non-Gaussian and asymmetric (for a Gaussian 
distribution, er and 068 coincide). Explicitly, cgs is de- 
fined by the value of | z p hot — z spcc | such that 68% of the 
objects have |%,hot — z S pcc| < o"68- We also use the 95% 
region 095 , defined similarly. In addition to these global 
metrics, we also define local versions of them in bins of 
redshift or magnitude. 

To search for an optimal photo-z estimator, we com- 
puted photo-z's using the ANN method with different 
combinations of input photometric observables. Five of 
these combinations are listed in Tabled In the first case, 
dubbed 01, the training and photo-z estimation are car- 
ried out using only the five magnitudes ugriz. In case 



7 



I I 1 I 

-Dl 


1 1/ 


- r < 20 






■ ~~ — ~ 

// , V 

m&- s. v 
■ \\ 


/// 

- /// 
■ Jag/ ' ' 


CT 68 = 0.018" 

cr = 0.026- 



1)1 




r > 20 


















"ee " 0.070 
c - 0. 134 





1 

- Dl 








o.a 


1A11 








CO 










d 




















d 


w 








d 






CT 68 = 
CT ' = 


05 : 




arT ,■ i , 






i , , , 



0.2 



0.4 



0.6 



0.2 0.4 0.6 0.8 1 



0.2 0.4 0.6 0.: 




CT 6B = 0.022 
ff = 0.035- 



) 0.2 


0.4 

Z spec 


0.6 


, , 1 
-NNP 
- r < 20 


, | 


1 \/ 






" : \ ' 
\- 


■ ■ A 




\V 
'■■\ 




CT 6B = 
CT = 

,,l 


0.018- 
0.027- 



0.2 



0.4 



0.6 




0.2 0.4 0.6 



1 1 1 1 1 1 1 1 1 , 
Y\p 


1 i i i | i ii/ 


r > 20 


" • v " 












' // • 






"ea " 0.075 
<7 - 0.138 



0.2 0.4 0.6 0.8 1 



CC2 






All 








//: ■ 
. ■ ■• 

■'JUT 

■ 


s. 








A 








' u — 


0.024 
0.059- 


0r, 1 1 




1 , , , 



0.2 0.4 0.6 o.; 



11,1! 

NNP 








All 










M 






■M 












CTgg 
'.* 


0.020 
0.054 



0.2 



4 0.6 



0.8 1 



Fig. 4 . £phot versus z S p e c for the validation set for different ranges of r magnitude and for different photo-z techniques. Left column: 
objects with r < 20; middle column: objects with r > 20; right column: all objects. Top row: ANN case Dl, where the input photometric 
data comprise the 5 magnitudes (ugriz) and the 5 concentration parameters, and the training is split into 5 bins of r magnitude Middle 
row: ANN case CC2, where the input data are the 4 colors u — g, g — r, r — i, i — z, and 3 concentration parameters c g c r C{. Bottom row: 
results for the NNP method, where the input data are the 5 magnitudes and 5 concentration parameters. In all cases, the photo-z methods 
used a training set with ~ 320, 000 objects, and the derived solutions were applied to an independent validation set with ~ 309, 000 objects 
and r < 22, reflecting the magnitude limit of the photometric sample. The solid line in each panel indicates z p hot = 2 spC c; * ne dashed and 
dotted lines show the 68% and 95% confidence regions as a function of z spec . The points display results for a random 10% subset of the 
validation set in each magnitude range. 



TABLE 2 

Summary of ANN cases 



Case 


Inputs/Description 


o~ 


0"68 


Ol 


ugriz 


0.0525 


0.0229 


CI 


ugriz + c u c g c r CiC z 


0.0519 


0.0224 


Dl 


ugriz + c u c g c r CiC z . Split training 


0.0519 


0.0209 


CC1 


u ~ 9i 9 ~ r, r — i, i — z 


0.0668 


0.0272 


CC2 


u ~ 9-. 9 ~ r, r — i, i — z + c g c r Ci 


0.0593 


0.0245 



Note. — Photo-z performance metrics u and ergs for the vali- 
dation set using different input parameters (magnitudes, colors, 
and concentration indices) and training procedures. 



CI, we use the five magnitudes and the five concentra- 
tion indices c u c g c r CiC z as the input parameters. In case 
CC1, we use only the four colors u — g, g — r, r — i, and 
i — z. In case CC2, we combine the four colors with the 
concentration indices c g c r Ci in the gri filters. Finally, in 



case Dl, we use the ugriz magnitudes and the c u c g c r CiC z 
concentration indices, but we split the training set and 
the photometric sample into 5 bins of r magnitude and 
perform separate ANN fits in each bin. In all five cases, 
we use an ANN with three hidden layers and tune the 
number of hidden nodes to keep the total number of de- 
grees of freedom of the network roughly the same for all 
cases. 

Table [5] provides a summary of the performance results 
of the different ANN cases. Wc find that using concentra- 
tion indices in addition to magnitudes (CI vs. 01) helps 
break some degeneracies and reduces the photo-z scatter 
by a few percent. Using only colors (CC1) degrades the 
photo-z performance by as much as 20%, mostly because 
the degeneracy between intrinsically red, nearby galax- 
ies and intrinsically blue, distant galaxies (with red ob- 
served colors) cannot be broken. Adding concentration 
indices to color-only training (CC2) helps break such a 
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Fig. 5. — The performance metrics Zbias; CT : an d °"68 f° r the ANN 
Dl and CC2 validation sets are shown as a function of r magnitude. 
CC2 performs relatively poorly for bright objects (r < 16), where 
the color-redshift relation is contaminated by faint objects with 
similar colors. In Dl, this problem is alleviated by the effective 
magnitude prior imposed by the training set. At faint magnitudes, 
the performance degrades as the photometric errors increase. 

degeneracy, because the concentration index correlates 
with galaxy type and hence intrinsic color. Of the five, 
case CC2 also yields the most realistic photometric red- 
shift distribution for the photometric sample (sec ij5.2[) . 
Finally, splitting the training set and photometric sam- 
ple into magnitude bins (Dl) produces results with the 
best performance metrics (a and crgg) of all the ANN 
cases we have tested. We choose Dl and CC2 as the 
best ANN cases and describe their results in more de- 
tail below; their outputs for the photometric sample are 
included in the public DR6 database. 

In Fig. 21 we plot photometric redshift, z p hot, for all 
objects in the validation set vs. true spectroscopic red- 
shift, z S pec, for the different photo-z methods and cases 
and in different ranges of r magnitude. The top row 
shows results for ANN case Dl, the middle row shows 
the performance of ANN case CC2, and the bottom row 
shows results for the NNP method using magnitudes and 
concentration indices as the input parameters. In each 
panel, the values of the corresponding global photo-z per- 
formance metrics a and ogs ar e shown. The redshift bias 
Zbias is typically much smaller than a or agSj since the 
photo-z methods are designed to minimize it (sec Fig. [5])- 
In each panel of Fig. 21 the solid line traces z p hot = z spec , 
i.e., the line for a perfect photo-z estimator. The dashed 
and dotted lines show the corresponding 68% and 95% 
regions, defined as above but in z spoc bins. Although 
each photo-z method probes the hypersurface defined by 
the photometric obscrvablcs and redshift in a different 
way, they produce very similar results, suggesting that 
our results are limited not by the photo-z technique em- 
ployed but by the intrinsic degeneracies in magnitudc- 
concentration-redshift space and by the photometric er- 
rors. 

In Figs. [5] and [6l we show the performance metrics 



■Zbias, Cj and (768 as a function of r magnitude and z spec 
for the validation set for the two preferred ANN cases. 
We see that the photo-z precision degrades considerably 
for objects with r > 20. This increased scatter is ex- 
pected, since the relative photometric errors increase as 
the nominal detection limit of the SDSS photometry is 
approached (see Table [T|) . While the bias for CC2 in- 
creases at r < 17, we note that the fraction of objects 
in the photometric sample which are that bright is very 
small. As a function of redshift, a and ergs increase dra- 
matically beyond z ~ 0.6 for the validation set. For the 
r < 20 part of the sample, the number of spectroscopic 
objects with z > 0.6 is simply too small to characterize 
the redshift-magnitude surface, as shown in the left panel 
of Fig. [7J For the faint objects (r > 20), the scatter is 
low for z between 0.4 and 0.6 and increases outside of 
that range. It's important to note that the photo-z per- 
formance metrics were calculated independently of spec- 
tral type. Since the the neural network and the training 
set were not optimized for any specific galaxy popula- 
tion (e.g., galaxies in clusters) it is possible that certain 
galaxy types may have photo-z's with worse (or better!) 
biases and dispersion. 

In Figure[71 we plot g— r color versus spectroscopic red- 
shift for the validation set for both bright (r < 20) and 
faint (r > 20) galaxies. The 2SLAQ and DEEP2 galaxies 
are highlighted by different colors (shades of grey), and 
the expected color-r edshift relation s for t he four spec- 
tral templates from IColeman et al.l f| 19801 ) (from early 
to late types) are indicated by the solid lines. We see 
that for the faint sample, in the range 0.4 < z < 0.6, 
the galaxies come mostly from the 2SLAQ survey, which 
used specific color cuts to select early-type galaxies at 
z ~ 0.5. Because early-type galaxies have a well-defined 
4000 A break feature, their photo-z's are well determined 
and their photo-z scatter is low. Outside of the range 
0.4 < z < 0.6, the validation set at faint magnitudes 
is dominated by bluer galaxies that do not have strong, 
broad spectral features, resulting in the larger photo-z 
scatter seen in Fig. [6] 

Fig. [6] shows that the common assumption that the 
photo-z scatter scales as (1 + z) is not consistent with 
our estimates for the SDSS sample. The functional form 
of the scatter versus redshift depends strongly on the 
underlying galaxy type distribution. 

5.2. Redshift Distributions 

So far, we have considered the scatter and bias of 
photo-z estimates. As discussed in §4.3[ it is also of in- 
terest to consider the predicted photo-z distribution as 
a whole. Different photo-z estimators may achieve sim- 
ilar values for the metrics Zbias, a, and ogs, but predict 
different forms for the photo-z distribution of the photo- 
metric sample. As we shall see, this is the case with the 
two ANN cases Dl and CC2. We therefore define two 
additional performance metrics to quantify the quality 
of the predicted photo-z distribution. The first metric, 
Odist, measures the rms difference between the binned 
Zphot and .Zspec distributions of the validation set, 



^dist — M (-Pphot ^spoc) ' (9) 

I— 1 
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Fig. 6. — Performance metrics 2bias> f> and °"68 f° r the ANN Dl and CC2 validation sets are shown as a function of z spec for r < 20 and 
r > 20. The increased scatter for objects with z > 0.6 is due to the 4000 A break shifting out of the r passband at around z = 0.7; beyond 
that redshift, the estimator effectively relies on only two passbands (i and z) to determine the photo-z's. Note that faint objects (r > 20) 
have worse scatter at low redshifts for both cases. This is likely due to the fact that the faint, low-redshift objects in the validation set are 
predominantly blue dwarf or irregular galaxies that do not have strong 4000 A breaks; in this case, the photo-z estimator must rely on less 
pronounced spectral features, resulting in larger photo-z scatter. 




Fig. 7. — g — r color vs spectroscopic redshift for galaxies in the validation set: left panel: galaxies with r < 20; right pane l: galaxies with 
r > 2 0. The solid curves show expected color-redshift relations of galaxies with different SED types, calculated using the IColeman et al.l 
1119801 ) spectral templates. The different colors (shades of grey) indicate galaxies from the different spectroscopic surveys contributing to 
the validation set. The 2SLAQ objects, denoted by red triangles, were selected to be mostly early-type galaxies. They are responsible for 
the minimum in a vs. z S pec for the r > 20 subsample in Fig. \E\ 



where -P phot is the height of the i th redshift bin of the 
^phot distribution, P spec is the height of the same redshift 
bin of the z spe c distribution, and iVbi n is the total number 
of redshift bins used. Here we use iVbin = 120 equally 
spaced redshift bins running from z = to z = 1.2. 

The second redshift distribution metric we employ is 
the KS statistic D, the maximum value of the absolute 



difference between the two (z p hot and z spcc ) cumulative 
redshift distribution functions. An advantage of the KS 
statistic is that it does not require binning the data in 
redshift. However, our use of the KS statistic to quantify 
the difference between the z p hot and z spcc distributions of 
the validation set likely does not adhere to formal statis- 
tical practice, since it turn outs that the probability for 
the KS statistic for both cases we consider is very close 
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TABLE 3 

Cdist AND KS STATISTIC FOR REDSHIFT DISTRIBUTION 





°"dist 


KS statistic 


r-mag bin 


CC2 Dl 


CC2 Dl 


r < 18 

18 < r < 19 

19 < r < 20 

20 < r < 21 

21 < r < 22 


0.0392 0.0330 
0.0390 0.0430 
0.0391 0.0399 
0.0403 0.0471 
0.0652 0.0702 


0.0632 0.0391 
0.0520 0.0533 
0.0366 0.0413 
0.0363 0.0665 
0.1051 0.1306 


All 


0.0383 0.0338 


0.0485 0.0307 



Note. — ffdist an d KS statistic results for CC2 and Dl 
ANN photo-z's for the validation set. 



to zero (|Press et alj|1992t) . 

Tabic [3] shows the values of o^ist and of the KS statistic 
D for the validation set for the Dl and CC2 ANN photo- 
z's, for different ranges of r magnitude. Although the 
CC2 photo-z distribution is a worse overall match to the 
z spoc distribution for the validation set, it works better 
than Dl for r > 18. Since the photometric sample is 
dominated by objects at r > 20 (see Fig. [IJ, these results 
suggest that CC2 should do a better job in estimating 
the redshift distribution of the photometric sample, even 
though Dl performs better by the standards of Zbias and 
a. 

The redshift distributions for the validation set are 
shown in Fig. [5] for the same bins of r magnitude as in 
Table [3] The Dl and CC2 z p \ wt distributions are shown 
in color, and the solid curves correspond to the z spec dis- 
tributions. The similarities between the z p hot and z sp p C 
distributions are consistent with the results of Table [Si 



In 



we noted that the z spC c distribution of the spec- 



troscopic sample, weighted to reproduce the color and 
magnitude distributions of the photometric sample, pro- 
vides an estimate of the unknown redshift distribution of 
the photometric sample. The z p hot distribution for the 
photometric sample, computed using ANN Dl or CC2, 
provides another estimate of the true redshift distribu- 
tion for the photometric sample, but one that we know 
suffers from bias (e.g., Fig. [5|). While we have not shown 
that the weighted z spec estimate of the redshift distri- 
bution is unbiased, it has the advantage that it makes 
direct use of the statistical properties of the photometric 
sample, and we believe it is our best estimate of the pho- 
tometric sample redshift distribution. Our final test of 
photo-z performance therefore compares the Zphot distri- 
bution for the photometric sample for the two ANN cases 
with the weighted z S pcc distribution of the spectroscopic 
sample. Agreement between the weighted z spcc distribu- 
tion and either one of the z p hot distributions does not 
guarantee that they are correct, but it at least provides 
a useful consistency check. 

In Fig. [9] we show the estimated redshift distributions 
of a random subsample containing ~ 1% of the objects in 
the DR6 photometric sample for both the CC2 and Dl 
ANN cases. The colored regions correspond to the z p hot 
distributions, and the solid lines indicate the weighted 
•Zspcc distribution of the spectroscopic sample. The z p hot 
distributions for CC2 are closer matches to the weighted 
z S pcc distributions for r > 18, and they do not show 
the peculiar features that the Dl photo-z distributions 
display, particularly at faint magnitudes. By the crite- 
rion of producing a more realistic redshift distribution 



for the photometric sample, the CC2 ANN estimator is 
preferred. 

5.3. Photo-z Errors 

In order to test the quality of our photo-z error esti- 
mates calculated with the NNE method, we introduce 
the concept of empirical error. For a set of objects 
(within the validation set) with similar NNE error, cr^ , 
the empirical error is defined as the 68% width of the 
I Zphot — z spcc | distribution for the set. If the NNE es- 
timator works properly, objects with similar NNE error 
should have similar underlying error distributions, i.e., 
the NNE error should correlate well with the empirical 
error. 

Fig. [10] shows the performance of the photo-z error es- 
timator by plotting the computed NNE error cr^ NE as a 
function of the corresponding empirical error for the val- 
idation set. Results are shown for the Dl and CC2 ANN 
photo-z's. The empirical error was calculated for bins 
containing 100 objects with similar <x^ NE . As expected, 
faint objects (r > 20) have larger errors than bright ob- 
jects (r < 20). The NNE estimated error correlates well 
with the empirical error even for the faint objects, in- 
dicating that the error estimator works properly for all 
magnitudes. The bulk of the bright objects have cr^ NE 
in the range 0.01 — 0.04, consistent with the overall rms 
photo-z scatter of a ~ 0.03 indicated in Fig[U Likewise, 
faint objects have cr^ NE in the range 0.02 — 0.3, while 
a ~ 0.13 for those objects. The NNE error is therefore 
a robust indicator of an object's photo-z quality. In par- 
ticular, we have carried out tests in which we cut objects 
with large NNE error from the sample and found that the 
remaining sample has smaller photo-z scatter and fewer 
catastrophic outliers. For applications in which photo- 
z precision is more important than completeness of the 
photometric sample, this can be a useful procedure. 

In Fig. 1111 we plot the normalized error distribution, 
i.e., the distribution of (zphot — z Bpcc )/af NE , for objects in 
the spectroscopic sample, using the Dl ANN estimator. 
The solid black lines are the data, and the dotted red 
lines show Gaussian distributions with zero mean and 
unit variance. The upper panels show results for the 
galaxies in the SDSS Main and LRG spectroscopic sam- 
ples. The lower panels show results for all validation-set 
galaxies, divided into bright (r < 20) and faint (r > 20) 
samples. These plots indicate that, averaged over the 
bulk of the spectroscopic sample, the photo-z estimates 
are nearly unbiased, the NNE error provides a good esti- 
mate of the true error, and the NNE error can be approx- 
imately interpreted as a Gaussian error in this average 
sense. Note that this does not imply that the photo-z 
error distributions in bins of magnitude or redshift are 
unbiased Gaussians: Figs. [5] and [6] show that they are 
not. 



6. QUERY FLAGS AND CAVEATS 

When querying the SDSS data server to produce the 
photometric sample for which we estimated photo-z's, 
we set the most relevant flags needed to produce a clean 
galaxy sample. However, some applications may require 
more stringent selection of objects. We advise users of 
the catalog to read the documentation about producing 



11 



PL, 



_l Dl All j 

1 


i i 1 i i i 1 i i i - 
\ r < 18 j 

V 1 

f , L i , , , i , , , - 


^ iV , , , , , , = 


1 19 < r < 20- 






I i , , | , , , - 

20 < r < 21_; 


: , , , ! , i , i , , , ; 

-_ 21 < r < 22j 



0.4 0.8 





z 



0.4 O.f 



PL, 



1 1 I 1 

CC2 All 




18 < r < 19- 




r < If 




1 1 I 1 1 1 
21 < r < 22 



0.4 



O.f 



Fig. 8. — Redshift distributions for the galaxies in the validation set for different r magnitude bins. Left panels: ANN Dl; right panels: 
ANN CC2. The colored regions indicate the ANN photo-z distributions, while the lines arc the spectroscopic redshift distributions. By 
eye, both ANN cases recover the true redshift distributions of the validation set well, except in the faintest magnitude bin, where the 
photometric errors become large. 
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Fig. 9. — Estimated redshift distributions for a random subsample of 1% of the galaxies in the DR6 photometric sample in different 
r-magnitudc bins. Left panels: ANN Dl; right panels: ANN CC2. Colors show the z p hot distributions. The lines show the estimated 
redshift distributions from the spectroscopic sample weighted to match the magnitude and color distributions of the photometric sample. 
Even though the two ANN cases correctly recover the validation set redshift distribution (Fig. [8j, their photo-z distributions for the 
photometric sample disagree. The photo-z distribution for Dl shows a peak at z ~ 0.4 that results mainly from the 20 < r < 21 bin. The 
CC2 distribution does not show such strong features, and in general it matches the weighted z Bpec distribution better. 



a clean galaxy sample on the SDSS website 7 . In partic- 
ular, users should consider requiring the BINNED1 (ob- 
ject detected at > 5a) flag and removing objects with the 
NODEBLEND (object is a blend but deblending was not 
possible) flag. The various PHOTO flags are described in 
more details at the above website as well as in Appendix 

El 



Finally, we note that the training of the photo-z es- 
timators included only galaxies, not stars. As a result, 
photo-z estimates for stars that contaminate the pho- 
tometric sample will be wrong, and cutting objects with 
low Zphot will not remove them. Our tests on star/galaxy 
separation in the photometric sample are briefly de- 
scribed in Appendix iBl 



7 http : //cas . sdss . org/dr6/en/help/docs/algorithm . asp 
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Empirical error Empirical error 



Fig. 10. — The estimated error from the NNE method, o-^ NE , is shown against the empirical error for objects in the validation set. Left 
panel: Dl ANN; right panel: CC2 ANN. Each point corresponds to a bin of 100 objects with similar cr^' NE . The black squares show results 
for bright objects (r < 20), the red triangles for faint objects (r > 20). As expected, faint objects have larger errors, but the NNE error 
correlates well with the empirical error over the full magnitude range. 
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Fig. 11. — Distributions of (z p hot — Zspec)/o"^ NB for objects in 
the spectroscopic sample, with photo-z's calculated using ANN Dl; 
the results for ANN CC2 are very similar. The solid black lines are 
the data, and the dotted red lines are Gaussians with zero mean 
and unit variance. Top left: SDSS Main spectroscopic sample; top 
right: SDSS LRG sample; bottom left: validation-set galaxies with 
r < 20; bottom right: validation-set galaxies with r > 20. In all 
cases the photo-z errors are reasonably well modeled by Gaussian 
distributions. 

The photo-z catalog can be accessed from the photoz2 
table in the DR6 context on the SD SS CasJobs site, at 
http : / / cas j obs . sdss . org/ cas j obs/. A query similar 
to the one in the Appendix provides all objects for which 
we computed photo-z's. Alternatively, one can simply 
perform a query that searches for objects with a photoz2 
entry. 



In addition to the photoz2 table in the SDSS 
CAS, an independent photoz table is also available, 
for which the photo-z's have b een compute d usin g 
a template-based technique; see iCsabai et al.l (|2007fk 
lAdelman-McCarthv et ait(|2007af ). 

8. CONCLUSIONS 

We have presented a public catalog of photometric red- 
shifts for the SDSS DR6 photometric sample using two 
different photo-z estimates, CC2 and Dl, based on the 
ANN method. As a consistency check, we have also 
calculated photo-z's using the NNP method, a nearest 
neighbor approach, which gives very good agreement 
with the ANN results. The CC2 and Dl photo-z results 
are comparable. For the validation set, the Dl photo-z 
estimates have lower photo-z scatter for bright galaxies 
(r < 20), and scatter similar to but slightly smaller than 
that of CC2 for objects with r > 20. Our tests indicate 
that the SDSS photo-z estimates are most reliable for 
galaxies with r < 20 and that the scatter increases signif- 
icantly at fainter magnitudes. For faint galaxies (r > 20), 
we recommend using the CC2 photo-z estimate, since the 
CC2 Zphot distribution most closely resembles the z spC c 
distribution for the validation set and the weighted z spec 
estimate for the redshift distribution of the photometric 
sample. For users who wish to use, for simplicity, a sin- 
gle photo-z estimator over the full magnitude range, we 
recommend using CC2. 

Finally, we have demonstrated that the NNE error esti- 
mator, included in the public catalog, provides a reliable 
measure of the photo-z errors and that the overall scaled 
photo-z errors are nearly Gaussian. 
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APPENDIX 
DATA QUERY CODE 

Here we provide the SDSS database query used to obtain part of the catalog containing the photometric sample 
used in this paper. Notice that the query requires the TYPE flag to be set to 3 (galaxies) and selects objects with 
dereddened model magnitude r < 22.0 to reflect the SDSS nominal detection limit. The query to obtain objects with 
Right Ascension (RA) in the range [0, 170) is 



declare ©BRIGHT bigint set @BRIGHT=dbo . f PhotoFlags( 'BRIGHT' ) 

declare ^SATURATED bigint set @SATURATED=dbo . f PhotoFlags ( ' SATURATED ' ) 

declare @SATUR_CENTER bigint set @SATUR_CENTER=dbo . f PhotoFlags ( ' SATUR_CENTER ' ) 

declare ObadJlags bigint set @bad_f lags= (©SATURATED I @SATUR_CENTER I OBRIGHT) 

select 

objID, ra, dec, type, dered_u,dered_g,dered_r,dered_i,dered_z, 
petroR50_u, petroR50_g, petroR50_r, petroR50_i, petroR50_z, 
petroR90_u, petroR90_g, petroR90_r, petroR90_i, petroR90_z 

into MyDb . all_ra_0_170 
FROM PhotoPrimary 

WHERE ((flags & @bad_flags)) = AND (deredjr<=22 . 0) AND (ra>=0.0) AND (ra<170.0) 
AND (type = 3) 

Here we provide a brief description of the flags used in the query: BRIGHT indicates that an object is a duplicate 
detection of an object with signal to noise greater than 200<r; SATURATED indicates that an object contains one 
or more saturated pixels; SATUR_CENTER indicates that the object center is close to at least one saturated pixel. 
Note that in selecting PRIMARY objects (using PhotoPrimary), we have implicitly selected objects that either do not 
have the BLENDED flag set or else have NODEBLEND set or nchild equal zero. In addition, the PRIMARY catalog 
contains no BRIGHT objects, so the cut on BRIGHT objects in the query above is in fact redundant. BLENDED 
objects have multiple peaks detected within them, which PHOTO attempts to dcblend into several CHILD objects. 
NODEBLEND objects are BLENDED but no deblcnding was attempted on them, because they are either too close 
to an EDGE, or too large, or one of their children overlaps an edge. A few percent of the objects in our photometric 
sample have NODEBLEND set; some users may wish to remove them. 

We also suggest that users require objects to have the BINNED1 flag set. BINNED1 objects were detected at > 5er 
significance in the original imaging frame. 

The SDSS webpage 8 provides further recommendations about flags, which we strongly recommend that users read. 

TESTS ON STAR-GALAXY SEPARATION 

We used the SDSS database TYPE flag to select the galaxy photometric sample for our photo-z catalogs. To study 
the robustness of the TYPE flag in separating galaxies from stars, we also carried out tests using an independent 



8 http: //cas . sdss . org/dr5/en/help/docs/algorithm. asp?key=f lags 
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star-galaxy classifier. Here we briefly describe both of these techniques and show the results obtained on photometric 
and spectroscopic samples. 

The TYPE fl ag is based on the star-g alaxy separator in the SDSS PHOTO pipeline, described in lLupton et alj (|2001f ) 
and updated in lAbazaiian et all (|2004[i . For a given object, the pipeline computes the PSF and cmodcl magnitudes in 
each passband 9 , where the cmodel magnitude is a measure of the flux using a composite of the best-fit de Vaucoulcurs 
and exponential models of the light profile. If the condition 

mpsF - m cmode i > 0.145 (Bl) 

is satisfied, type is set to GALAXY for that band; otherwise, type is set to STAR. The object's global TYPE is 
determined by the sa me criterion, but now applied to the summed PSF and cmodel fluxes from all passbands in which 
the object is detected. iLupton et alj (|2001l ) show that an earlier version of this simple cut works at the 95% confidence 
level for SDSS objects brighter than r = 21. 

The second star-galaxy separator we tested is the galaxy probability defined in IScranton et all ((2002) • The galaxy 
probability (hereafter probgals) is a Baycsian probability estimate that an object is a galaxy (and not a star), given 
the object's magnitudes and concentration parameter. Here the concentration parameter is not the ratio of Petrosian 
radii but is defined as the difference between an object's PSF and exponential-model r magnitudes. This concentration 
parameter is close to zero for stars, is positive for bright galaxies, and approaches zero as galaxies become fainter. 

We conducted some simple tests to compare these classification schemes. If we set the Bayesian probgals threshold 
to a value between 0.5 and 0.9, then both methods agree on the classification of more than 90% of the objects for a 
random 1% subset of the SDSS photometric sample. We also tested the methods on a spectroscopic sample of 29,229 
galaxies and stars (counting independent photometric measurements of each object) from the 2SLAQ and DEEP2 
catalogs with r < 22. Defining stars as objects with z spec < 0.01, the sample contains 24,541 galaxies and 4,688 stars. 
We wish to compare this spectroscopic "truth table" with the photometric classification of the two methods and with 
a combined method that classifies an object as a galaxy if and only if both separators classify it as a galaxy. For the 
purposes of this test, we say that the Bayesian scheme classifies an object as a galaxy if probgals > 0.5. We define 
galaxy completeness as the ratio of correctly identified galaxies to the total number of galaxies in the spectroscopic 
sample. Purity is defined as the ratio of correctly identified galaxies to the number of objects identified (correctly 
or not) as galaxies by the classifier. The purity depends in part on the relative numbers of galaxies and stars in the 
spectroscopic sample. 

Fig. IBll shows the completeness and purity of the resulting galaxy catalogs in bins of r magnitude for this spec- 
troscopic sample. Overall, the Bayesian separator and PHOTO TYPE produce similar results for galaxy purity and 
completeness. Moreover, the agreement between the two classification methods is quite good on an object-by-object 
basis. The Bayesian separator with probgals > 0.5 achieves slightly higher completeness and slightly lower purity. By 
varying the probgals boundary, we could improve the purity of the Bayesian galaxy sample at the expense of degrading 
its completeness. We note that the best value of probgals to use in defining a galaxy photometric sample depends 
on the scientific applications of the sample, i.e., on whether completeness or purity is the more important feature. In 
statistical a pplications, instead o f defining a galaxy sample one can also choose to weight objects by their Bayesian 
probability (|Scranton et al.ll2002l ). 

Based on this test, we conclude that the photometric sample for which we have estimated photo-z's has better than 
90% galaxy purity. 

PHOTOMETRIC REDSHIFTS FOR SDSS DR5 

An earlier version of the photo-z catalog, produced for SDSS Data Release 5 (DR5), is publicly available on the 
SDSS DR5 website (and is also called photoz2). The methods used to construct that photo-z catalog were similar to 
the ones employed here for DR6, but the latter incorporates a number of important improvements. Here we briefly 
outline the differences between the two. We strongly recommend use of the DR6 photo-z catalog instead of the DR5 
catalog. 

The photometric galaxy sample selection has improved from DR5 to DR6, because we used more stringent cuts 
in defining the DR6 sample. The DR6 sample selection is described above in Appendix [A] The DR5 photometric 
galaxy sample selection required the cmodel and model r magnitudes to l ie in the ranges r CI? odci £ (14-0, 22.0) and 
f model € (13.5, 22.5), and also required the value of the smear polarizability (IShcldon et al.ll2004T) to be m r > 0.8. Also, 
for DR5, star-galaxy separation used the Bayesian estimator (see Appendix [Bj with the value probgals > 0.8, while 
for DR6 we used PHOTO TYPE. The additional cuts used for the DR6 catalog have produced a cleaner and more 
reliable galaxy sample. 

The DR5 photo-z catalog included a number of flags describing the expected photo-z quality, shown in Table IC1I 
These flags were based on the detection or non-detection of the object in all passbands and on the value of the r model 
magnitude. An object was classified as bright (faint) if r < 20 (r > 20). An object was flagged as "incomplete" if it 
was not detected in all five SDSS passbands. Table [CTl shows the corresponding flag values and the number of objects 
assigned each flag value. For the DR6 sample, given the stricter sample selection, a very small number of objects 
would have been classified as incomplete by the definition above, and they have been removed from the sample. As a 
result, for DR6, we only supply the bright/faint flag, as shown in Table [C2l 

9 http:/ /www. sdss.org/dr5/algorithms/photometry.html 



15 



1 1 1 1 l 1 1 1 1 l 1 1 1 1 l 1 1 1 1 l 



1 f 



probgals > 0.5 



1 1 1 1 I 1 1 1 1 I 1 1 1 1 I 1 1 1 1 



_ Bayesian 

- Photo 

_ Combined 

. . i i I . . . . I i . . . I i . . . 
8 19 20 21 22 

r Magnitude 



Fig. Bl. — Top panel: completeness and bottom panel: purity for the Bayesian and PHOTO TYPE galaxy classifications as well as for a 
combination of the two, using a sample of galaxies with spectroscopic classification. Results for the Bayesian separator have the probgals 
lower bound set to 0.5. 



TABLE CI 
DR5 Catalog flag 



flag N- of Galaxies Object Description 





86.1 million 


All 







12.6 million 


Complete & 


i bright 


1 


0.6 million 


Incomplete & 


i bright 


2 


59.0 million 


Complete & 


I faint 


3 


13.9 million 


Incomplete & 


i faint 



Note. — The flag scheme for the DR5 cata- 
log is based on object detection in some/all pass- 
bands and the r magnitude. Incomplete objects 
are undetected in at least one of the passbands 
(ugriz) and faint objects have r > 20. 



TABLE C2 
DR6 Catalog flag 



flag 


N e of Galaxies 


Object Description 




77.4 million 


All 





11.5 million 


bright 


2 


65.9 million 


faint 



Note. — The flag scheme for the DR6 cat- 
alog is based solely on the on the r magnitude: 
faint objects have r > 20. 



The spectroscopic training set used for the DR6 photo-z catalog has important additions compared to the one used 
for the DR5 catalog. In particular, for DR6 we added the DEEP2 spectroscopic catalog (which became publicly 
available), which made the training set more complete at faint magnitudes. We also implemented more stringent 
spectroscopic quality cuts to the training set used for DR6. 

Unlike the DR5 training set, the DR6 training set does not contain objects from the S PSS "special" plates, extra 
spect roscopic observations designed to target specific objects for various scientific studies (|Adelman-McCarthv et al.1 
120061 ). In our tests, we find that the lack of special plates does not result in any degradation of the photo-z quality. 

The photo-z algorithm also changed from DR5 to DR6: we increased the number of hidden-layer nodes in the ANN 
and we added the concentration indices to the data inputs. Our tests indicated that this leads to improved photo-z 
performance according to our metrics. In addition, the CC2 method differs from DR5 photo-z's further in that CC2 
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uses only the color information and not the raw magnitudes. For general purpose, full sample photo-z's, we recommend 
using CC2 photo-z's over both DR5 and Dl photo-z's. Finally, we have carried out more extensive tests of the DR6 
photo-z's than were done for DR5, increasing our confidence in the robustness of the photo-z estimates. 
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